Text Analysis

Sentiment Analysis

Table of Contents


Processing Text

Before we can even begin to dive into analyzing text, we must first process the text. Processing text involves several steps that will be combined in various ways, depending on what we are trying to accomplish.

Stemming

Tense aside, are chewed, chew, and chewing the same thing? Yes, but what if we compare the actual strings? On a string comparison side, are they the same? No. We have a string with 6, 4, and 7 characters, respectively.

What if we remove the suffixes, “ed” and “ing” – we are left with three instances of “chew”? Now we have something that is equivalent in meaning and in a string sense. This is the goal of stemming.

Let’s take a look to see how this works (you will need to install tm and SnowballC first):


chewStrings <- c("chew", "chewing", "chewed", "chewer")

tm::stemDocument(chewStrings)

[1] "chew"   "chew"   "chew"   "chewer"

We got exactly what we expected, right? You might have noticed that “chewer” did not get stemmed. Do you have any idea why? Let’s think through it together. “Chew”, “chewing”, and “chewed” are all verbs related to the act of chewinging. “Chewer”, on the other hand, is a person who chews – it is a noun. Martin Porter’s stemming algorithm works incredibly well!

Hopefully, this makes conceptual sense; however, we also need to understand why we need to do it. In a great many text-based methods, we are going to create a matrix that keeps track of every term (i.e., word) in every document – this is known as a document-term matrix. If we know that “chew”, “chewing”, and “chewed” all refer to the same thing, we want it just represented once within our document-term matrix.

Shall we take a look?


library(tm)

documents <- c("I like to chew", 
              "I have chewed my whole life", 
              "Chewing and stomping through the fields", 
              "I am a chewer")

documentsCorp <- tm::SimpleCorpus(VectorSource(documents))

documentsDTM <- DocumentTermMatrix(documentsCorp)

inspect(documentsDTM)

<<DocumentTermMatrix (documents: 4, terms: 13)>>
Non-/sparse entries: 13/39
Sparsity           : 75%
Maximal term length: 8
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs and chew chewed chewing fields have life like stomping whole
   1   0    1      0       0      0    0    0    1        0     0
   2   0    0      1       0      0    1    1    0        0     1
   3   1    0      0       1      1    0    0    0        1     0
   4   0    0      0       0      0    0    0    0        0     0

We can see that without stemming, we have 9 terms (things like “I”, “a”, and “to” get removed automatically). Let’s do some stemming now:


documentsStemmed <- stemDocument(documents)

documentsStemmed

[1] "I like to chew"                  
[2] "I have chew my whole life"       
[3] "Chew and stomp through the field"
[4] "I am a chewer"                   

And now the document-term matrix:


stemmedDocCorp <- tm::SimpleCorpus(VectorSource(documentsStemmed))

stemmedDocDTM <- DocumentTermMatrix(stemmedDocCorp)

inspect(stemmedDocDTM)

<<DocumentTermMatrix (documents: 4, terms: 11)>>
Non-/sparse entries: 13/31
Sparsity           : 70%
Maximal term length: 7
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs and chew field have life like stomp the through whole
   1   0    1     0    0    0    1     0   0       0     0
   2   0    1     0    1    1    0     0   0       0     1
   3   1    1     1    0    0    0     1   1       1     0
   4   0    0     0    0    0    0     0   0       0     0

If we are trying to find documents that are covering similar content or talking about similar things, this document-term matrix will help to draw better conclusions, because it is clear that the first three documents are talking about the act of chewing and this document-term matrix reflects that.

Lemmatization

Stemming is often sufficient (and most modern stemmers work pretty well on their own). Still, stemming is slightly more akin to amputating an arm with a battle ax – it works, but it is brute force. Lemmatization is a more sophisticated approach. You might have already guessed that lemmatization will find the lemma of a word and since you likely know about morphology, you already know that the lemma of a word is its canonical form. A group of words that form the same idea are called a lexeme (am, be, are are all within the same lexeme). Generally, the smallest form of the word is chosen as the lemma. This is a really interesting area of linguistics, but we don’t need to dive fully in.

Instead, let’s see it in action.

If we compare some “chewing” stuff on stemming and lemmatization, we can see what we get:


library(textstem)

chewStrings <- c("chew", "chewing", "chewed", "chewer")

stem_words(chewStrings)

[1] "chew"   "chew"   "chew"   "chewer"

lemmatize_words(chewStrings)

[1] "chew"   "chew"   "chew"   "chewer"

Absolutely nothing different. Both stemming and lemmatizing will perform the same task. The act of chewing is comprised of a past, present, and future tense, and chew is the lemma; chewer is still seen as something else entirely.

But let’s take a look at something different. If we have a string of the most lovely words, what might happen?


lovelyString <- c("lovely", "lovelier", "loveliest")

stem_words(lovelyString)

[1] "love"      "loveli"    "loveliest"

That is about as close to “bigly” nonsense as we could possibly get without going into Dr. Suess mode.

But if we try lemmatization:


lemmatize_words(lovelyString)

[1] "lovely" "lovely" "lovely"

We get something that starts to make sense. Now, let’s try these on some actual chunks of text and see what happens.


# This data is in the "data" folder on Sakai!

if(Sys.info()["sysname"] == "Darwin") {
  load("~/courses/unstructured/data/allLyricsDF.RData")
} else {
  load("C:/Users/sberry5/Documents/teaching/courses/unstructured/data/allLyricsDF.RData")
}

sampleLyrics <- allLyricsDF[40, ]

sampleLyrics$lyrics

[1] \n          \n            \n            [Verse 1]\nShe loved him like he was\nThe last man on Earth\nGave him everything she ever had\nHe'd break her spirit down\nThen come loving up on her\nGive a little, then take it back\nShe'd tell him about her dreams\nHe'd just shoot 'em down\nLord he loved to make her cry\n"You're crazy for believing\nYou'll ever leave the ground"\nHe said, "Only angels know how to fly"\n[Chorus]\nAnd with a broken wing\nShe still sings\nShe keeps an eye on the sky\nWith a broken wing\nShe carries her dreams\nMan, you ought to see her fly\n[Verse 2]\nOne Sunday morning\nShe didn't go to church\nHe wondered why she didn't leave\nHe went up to the bedroom\nFound a note by the window\nWith the curtains blowin' in the breeze\n[Chorus]\nAnd with a broken wing\nShe still sings\nShe keeps an eye on the sky\nWith a broken wing\nShe carries her dreams\nMan, you ought to see her fly\n[Outro]\nWith a broken wing\nShe carries her dreams\nMan, you ought to see her\nFlyyyyyyyyyyyyy\nWith a broken wing\n\n\n            \n          \n        
3106 Levels: \n          \n            \n            I'll need time to get you off my mind\nAnd I may sometimes bother you\nTry to be in touch with you\nEven ask too much of you from time to time\nNow and then\nLord, you know I'll need a friend\nAnd 'till I get used to losing you\nLet me keep on using you\n'Til I can make it on my own\nI'll get by, but no matter how I try\nThere'll be times that you'll know I'll call\nChances are my tears will fall\nAnd I'll have no pride at all, from time to time\nBut they say, oh, there'll be a brighter day\nBut 'til then I lean on you\nThat's all I mean to do\n'Til I can make it on my own\nSurely someday I'll look up and see the morning sun\nWithout another lonely night behind me\nThen I'll know I'm over you and all my crying's done\nNo more hurtin' memories can find me\nBut 'til then\nLord, You know I'm gonna need a friend\n'Til I get used to losing you\nLet me keep on using you\n'Til I can make it on my own\n'Til I can make it on my own\n\n\n            \n          \n         ...

Of course, we will need to do some cleaning on our text first:


library(dplyr)

library(stringr)

cleanLyrics <- sampleLyrics$lyrics %>% 
  str_replace_all(., "\n", " ") %>% 
  str_replace_all(., "\\[[A-Za-z]+\\s*[0-9]*]", "") %>%
  str_squish(.) %>% 
  gsub("([a-z])([A-Z])", "\\1 \\2", .)

You Try It!

You can take it for a spin – compare what comes from stemming and lemmatizing:

Here is something very interesting:


microbenchmark::microbenchmark(stem_strings(cleanLyrics), 
                               lemmatize_strings(cleanLyrics))

What does this tell you?

The question, then, is what do you decide to do. For my money, lemmatization does a better job and getting words down to their actual meaning.

Stop Words

Some words do us very little good: articles, prepositions, and very high frequency words. These are all words that need to be removed. Fortunately, you don’t have to do this on your own – a great many dictionaries exist that contain words ready for removal.


tm::stopwords("en")

  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"      

Removing stopwords takes little effort!


documents = c("I like to chew.", 
              "I am stompin and chewing.", 
              "Chewing is in my blood.", 
              "I am a chewer")

tm::removeWords(documents, words = stopwords("en"))

[1] "I like  chew."        "I  stompin  chewing."
[3] "Chewing    blood."    "I   chewer"          

We can even include custom stopwords:


tm::removeWords(documents, words = c("blood", stopwords("en")))

[1] "I like  chew."        "I  stompin  chewing."
[3] "Chewing    ."         "I   chewer"          

There are many different stopword lists out there, so you might want to poke around just a little bit to find something that will suit the needs of a particular project.


library(stopwords)

Applied to our previous song, here is what we would get:


tm::removeWords(cleanLyrics, words = stopwords("en"))

[1] "She loved  like   The last man  Earth Gave  everything  ever  He'd break  spirit  Then come loving    Give  little,  take  back She'd tell    dreams He'd just shoot 'em  Lord  loved  make  cry \"You're crazy  believing You'll ever leave  ground\" He said, \"Only angels know   fly\" And   broken wing She still sings She keeps  eye   sky With  broken wing She carries  dreams Man,    see  fly One Sunday morning She  go  church He wondered    leave He went    bedroom Found  note   window With  curtains blowin'   breeze And   broken wing She still sings She keeps  eye   sky With  broken wing She carries  dreams Man,    see  fly With  broken wing She carries  dreams Man,    see  Flyyyyyyyyyyyyy With  broken wing"

Now, let’s use the textclean package to handle contraction replacement:


replacedText <- textclean::replace_contraction(cleanLyrics)

tm::removeWords(replacedText, words = stopwords("en"))

[1] "She loved  like   The last man  Earth Gave  everything  ever    break  spirit  Then come loving    Give  little,  take  back She  tell    dreams   just shoot 'em  Lord  loved  make  cry \"  crazy  believing   ever leave  ground\" He said, \"Only angels know   fly\" And   broken wing She still sings She keeps  eye   sky With  broken wing She carries  dreams Man,    see  fly One Sunday morning She   go  church He wondered     leave He went    bedroom Found  note   window With  curtains blowin'   breeze And   broken wing She still sings She keeps  eye   sky With  broken wing She carries  dreams Man,    see  fly With  broken wing She carries  dreams Man,    see  Flyyyyyyyyyyyyy With  broken wing"

There are several great functions in textclean – I highly suggest you check it out.

And one final point to make:


gsub('"', "", replacedText)

[1] "She loved him like he was The last man on Earth Gave him everything she ever had he would break her spirit down Then come loving up on her Give a little, then take it back She would tell him about her dreams he would just shoot 'em down Lord he loved to make her cry you are crazy for believing you will ever leave the ground He said, Only angels know how to fly And with a broken wing She still sings She keeps an eye on the sky With a broken wing She carries her dreams Man, you ought to see her fly One Sunday morning She did not go to church He wondered why she did not leave He went up to the bedroom Found a note by the window With the curtains blowin' in the breeze And with a broken wing She still sings She keeps an eye on the sky With a broken wing She carries her dreams Man, you ought to see her fly With a broken wing She carries her dreams Man, you ought to see her Flyyyyyyyyyyyyy With a broken wing"

Text Processing Tools

There are several R packages that will help us process text. The tm package is popular and automates most of our work. You already saw how we use the stemming and stopword removal functions, but tm is full of fun stuff and allows for one pass text processing.


documents <- c("I like to chew.", 
              "I am stompin and chewing.", 
              "Chewing is in my blood.", 
              "I am a chewer")

documentCorp <- SimpleCorpus(VectorSource(documents))

stopWordRemoval <- function(x) {
  removeWords(x, stopwords("en"))
}

textPrepFunctions <- list(tolower,
                         removePunctuation,
                         lemmatize_strings,
                         stopWordRemoval,
                         removeNumbers,
                         stripWhitespace)

documentCorp <- tm_map(documentCorp, FUN = tm_reduce, tmFuns = textPrepFunctions)

documentCorp[1][[1]]$content

Once you get your text tidied up (or even before), you can produce some visualizations!


library(tidytext)

library(wordcloud2)

allLyricsDF %>%
  dplyr::filter(stringDistance < .2) %>% 
  dplyr::select(lyrics, returnedArtistName) %>%
  mutate(lyrics = as.character(lyrics), 
         lyrics = str_replace_all(lyrics, "\n", " "),   
         lyrics = str_replace_all(lyrics, "\\[[A-Za-z]+\\s*[0-9]*]", ""), 
         lyrics = str_squish(lyrics), 
         lyrics = gsub("([a-z])([A-Z])", "\\1 \\2", lyrics)) %>%
  unnest_tokens(word, lyrics) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(n > 25) %>% 
  na.omit() %>% 
  wordcloud2(shape = "cardioid")

Sentiment Analysis

Sentiment analysis is commonly used when we want to know the general feelings of what someone has written or said. Sentiment analysis is commonly seen applied to Twitter and other social media posts, but we can use it anywhere where people have written/said something (product reviews, song lyrics, final statements).

Sentiment can take many different forms: positive/negative affect, emotional states, and even financial contexts.

Let’s take a peak at some simple sentiment analysis.

Simple Sentiment

Let’s consider the following statements:


library(tidytext)

statement <- "I dislike beer, but I really love the shine."

tokens <- data_frame(text = statement) %>% 
  unnest_tokens(tbl = ., output = word, input = text)

tokens

# A tibble: 9 x 1
  word   
  <chr>  
1 i      
2 dislike
3 beer   
4 but    
5 i      
6 really 
7 love   
8 the    
9 shine  

Now, we can compare the tokens within our statement to some pre-defined dictionary of positive and negative words.


library(tidyr)

tokens %>%
  inner_join(get_sentiments("bing")) %>% 
  count(sentiment) %>% 
  pivot_wider(values_from = n, names_from = sentiment) %>% 
  mutate(sentiment = positive - negative)

# A tibble: 1 x 3
  negative positive sentiment
     <int>    <int>     <int>
1        1        2         1

When we use Bing’s dictionary, we see that we get one positive word (love) and negative word (dislike) with a neutral overall sentiment (a sentiment of 0 would indicate neutrality, while anything above 0 has an increasing amount of positivity and anything below 0 has an increasing amount of negativity).

Do you think that disklike and love are of the same magnitude? If I had to make a wild guess, I might say that love is stronger than dislike. Let’s switch out our sentiment library to get something with a little better notion of polarity magnitute.


tokens %>%
  inner_join(get_sentiments("afinn"))

# A tibble: 2 x 2
  word    value
  <chr>   <dbl>
1 dislike    -2
2 love        3

Now this looks a bit more interesting! “Love” has a stronger positive polarity than “dislike” has negative polarity. So, we could guess that we would have some positive sentiment.

If we divide the sum of our word sentiments by the number of words within the dictionary, we should get an idea of our sentences overall sentiment.


tokens %>%
  inner_join(get_sentiments("afinn")) %>% 
  summarize(n = nrow(.), sentSum = sum(value)) %>% 
  mutate(sentiment = sentSum / n)

# A tibble: 1 x 3
      n sentSum sentiment
  <int>   <dbl>     <dbl>
1     2       1       0.5

Our sentiment of .5 tells us that our sentence is positive, even if only slightly so.

While these simple sentiment analyses provide some decent measures to the sentiment of our text, we are ignoring big chunks of our text by just counting keywords.

For example, it is probably fair to say that “really love” is stronger than just “love”. We might want to switch over to some techniques that consider n-grams and other text features to calculate sentiment.

Smarter Sentiment Analysis

When we use sentiment analysis that is aware of context, valence (“love” is stronger than “like”), modifiers (e.g., “really love”), and adversative statements (“but,…”, “however,…”), we get a better idea about the real sentiment of the text.

We will use the jockers sentiment library, but many more available. Depending on your exact needs, there are some dictionaries designed for different applications.

Before we engage in our whole sentiment analysis, let’s take a look at a few things.

Here is the dictionary that jockers will use.


lexicon::hash_sentiment_jockers

                 x     y
    1:     abandon -0.75
    2:   abandoned -0.50
    3:   abandoner -0.25
    4: abandonment -0.25
    5:    abandons -1.00
   ---                  
10734:     zealous  0.40
10735:      zenith  0.40
10736:        zest  0.50
10737:      zombie -0.25
10738:     zombies -0.25

You might want to use View() to get a complete look at what is happening in there.

We should also take a peak at our valence shifters:


lexicon::hash_valence_shifters

              x y
  1: absolutely 2
  2:      acute 2
  3:    acutely 2
  4:      ain't 1
  5:       aint 1
 ---             
136:    whereas 4
137:      won't 1
138:       wont 1
139:   wouldn't 1
140:    wouldnt 1

With all of that out of the way, let’s get down to the matter at hand:


library(sentimentr)

library(lexicon)

library(magrittr)

statement = "I dislike beer, but I really love the shine."

sentiment(statement, polarity_dt = lexicon::hash_sentiment_jockers)

   element_id sentence_id word_count sentiment
1:          1           1          9    0.9375

We can see that we get a much stronger sentiment score when we include more information within the sentence. While the first part of our sentence starts out with a negative word (dislike has a sentiment value of -1), we have an adversarial “but” that will downweight whatever is in the initial phrase and then we will have the amplified (from “really”, with a default weight of .8) sentiment of “love” (with a weight of .75 in our dictionary).

With all of this together, we get a much better idea about the sentiment of our text.

There are also some handy functions within sentimentr:


extractedTerms <- extract_sentiment_terms(statement, polarity_dt = lexicon::hash_sentiment_jockers)

attributes(extractedTerms)$counts

   words polarity n
1:  <NA>        0 1

attributes(extractedTerms)$elements

   element_id sentence_id words polarity
1:          1           1  <NA>        0

Back To The Music

While the text that we have used so far serves its purpose as an example quite well, we can always take a look at other written words.


load(url("https://raw.githubusercontent.com/saberry/courses/master/hash_sentiment_vadar.RData"))

cleanLyrics <- allLyricsDF %>%
  filter(stringDistance < .2) %>% 
  dplyr::select(lyrics, returnedArtistName, returnedSong) %>%
  mutate(lyrics = as.character(lyrics), 
         lyrics = str_replace_all(lyrics, "\n", " "),   
         lyrics = str_replace_all(lyrics, "(\\[.*?\\])", ""), # look different?
         lyrics = str_squish(lyrics), 
         lyrics = gsub("([a-z])([A-Z])", "\\1 \\2", lyrics))

songSentiment <- sentiment(get_sentences(cleanLyrics), 
          polarity_dt = hash_sentiment_vadar) %>% 
  group_by(returnedSong) %>% 
  summarize(meanSentiment = mean(sentiment)) # Check sentimentr for better options!

Naturally, we would want to join those sentiment values up with our original data:


cleanLyrics <- left_join(cleanLyrics, songSentiment, by = "returnedSong")

From here, we have several choices in front of us. One, we could use those sentiment values within a model (e.g., we might want to predict charting position). Or, we could use them for some further exploration:


library(DT)

sentimentBreaks = c(-1.7, -.5, 0, .5, 1.7)

breakColors = c('rgb(178,24,43)','rgb(239,138,98)','rgb(253,219,199)','rgb(209,229,240)','rgb(103,169,207)','rgb(33,102,172)')

datatable(cleanLyrics, rownames = FALSE, 
              options = list(pageLength = 15, escape = FALSE, 
                             columnDefs = list(list(targets = 1, visible = FALSE)))) %>% 
  formatStyle("lyrics", "meanSentiment", backgroundColor = styleInterval(sentimentBreaks, breakColors))

We can also do some checking over time:


library(ggplot2)

if(Sys.info()["sysname"] == "Darwin") {
  load("~/courses/unstructured/data/countryTop50.RData")
} else {
  load("C:/Users/sberry5/Documents/teaching/courses/unstructured/data/countryTop50.RData")
}


allTop50 <- allTop50 %>% 
  group_by(song) %>% 
  slice(1)

cleanLyrics <- left_join(cleanLyrics, allTop50, by = c("returnedSong" = "song"))

cleanLyrics %>% 
  group_by(date) %>% 
  na.omit() %>% 
  summarize(meanSentiment = mean(meanSentiment)) %>% 
  ggplot(., aes(date, meanSentiment)) + 
  geom_point() +
  theme_minimal()

That is pretty messy (but I am curious about that really happy month), so let’s try something else:


library(gganimate)

animatedYears <- cleanLyrics %>% 
  mutate(year = lubridate::year(date), 
         month = lubridate::month(date)) %>% 
  group_by(year, month, date) %>% 
  na.omit() %>% 
  summarize(meanSentiment = mean(meanSentiment)) %>% 
  ggplot(., aes(as.factor(month), meanSentiment, color = meanSentiment)) + 
  geom_point() +
  scale_color_distiller(type = "div") +
  theme_minimal() +
  transition_states(year,
                    transition_length = length(1975:2019),
                    state_length = 3) +
  ggtitle('Year: {closest_state}')

animate(animatedYears, fps = 5)


cleanLyrics %>% 
  mutate(year = lubridate::year(date)) %>% 
  group_by(year) %>% 
  na.omit() %>% 
  summarize(meanSentiment = mean(meanSentiment)) %>% 
  ggplot(., aes(year, meanSentiment)) + 
  geom_col() +
  theme_minimal()

Other Text Fun

Sentiment analysis is always a handy tool to have around. You might also want to explore other descriptive aspects of your text.

The koRpus package allows for all types of interesting types descriptives. There are a great number of readability and lexical diversity statistics (Fucks is likely my favorite).

We need to tokenize our text in a manner that will please koRpus.


library(koRpus)

readability(
  tokenize(
    cleanLyrics$lyrics[27], format = "obj", lang = "en"
  ), quiet = TRUE
)

Automated Readability Index (ARI)
  Parameters: default 
       Grade: 183.5 


Coleman-Liau
  Parameters: default 
         ECP: 67% (estimted cloze percentage)
       Grade: 4.63 
       Grade: 4.63 (short formula)


Danielson-Bryan
  Parameters: default 
         DB1: 37.19 
         DB2: -246.93 
       Grade: >= 13 (college) 


Dickes-Steiwer's Handformel
  Parameters: default 
         TTR: 0.35 
       Score: 34.37 


Easy Listening Formula
  Parameters: default 
      Exsyls: 59 
       Score: 59 


Farr-Jenkins-Paterson
  Parameters: default 
          RE: -279.3 
       Grade: >= 16 (college graduate) 


Flesch Reading Ease
  Parameters: en (Flesch) 
          RE: -274.56 
       Grade: >= 16 (college graduate) 


Flesch-Kincaid Grade Level
  Parameters: default 
       Grade: 145.21 
         Age: 150.21 


Gunning Frequency of Gobbledygook (FOG)
  Parameters: default 
       Grade: 151.22 


FORCAST
  Parameters: default 
       Grade: 7.35 
         Age: 12.35 


Fucks' Stilcharakteristik
       Score: 1315 
       Grade: 36.26 


Linsear Write
  Parameters: default 
  Easy words: 98.94 
  Hard words: 1.06 
       Grade: 192.5 


Läsbarhetsindex (LIX)
  Parameters: default 
       Index: 382.04 
      Rating: very difficult 
       Grade: > 11 


Neue Wiener Sachtextformeln
  Parameters: default 
       nWS 1: 60.26 
       nWS 2: 61.54 
       nWS 3: 71.02 
       nWS 4: 98.73 


Readability Index (RIX)
  Parameters: default 
       Index: 19 
       Grade: > 12 (college) 


Simple Measure of Gobbledygook (SMOG)
  Parameters: default 
       Grade: 14.55 
         Age: 19.55 


Strain Index
  Parameters: default 
       Index: 132 


Kuntzsch's Text-Redundanz-Index
  Parameters: default 
 Short words: 318 
 Punctuation: 36 
     Foreign: 0 
       Score: 39.55 


Tuldava's Text Difficulty Formula
  Parameters: default 
       Index: 6.92 


Wheeler-Smith
  Parameters: default 
       Score: 590 
       Grade: > 4 

Text language: en 

For other lingual measures, you can check out the quanteda package:


library(quanteda)

tokens(cleanLyrics$lyrics[27]) %>%
    textstat_lexdiv(measure = c("TTR", "CTTR", "K"))

  document      TTR     CTTR        K
1    text1 0.374269 4.894202 205.5333

For TTR, consider 0 to be the same word on repeat and 1 to be no word repitition.


library(rvest)

choctawBingoLyrics <- read_html("https://genius.com/James-mcmurtry-choctaw-bingo-lyrics") %>% 
  html_nodes(".lyrics") %>% 
  html_text() %>% 
  str_replace_all(., "\n", " ") %>% 
  str_replace_all(., "\\[\\w+\\s*\\w*\\]", "") %>% 
  str_squish(.) %>% 
  gsub("([a-z])([A-Z])", "\\1 \\2", .)

tokens(choctawBingoLyrics) %>%
    textstat_lexdiv(measure = c("TTR", "CTTR", "K"))

  document       TTR     CTTR        K
1    text1 0.4774775 8.713146 87.24941

Embeddings

English is hard and seeing how words relate to other words can be tricky. Let’s think through something weird:

\[\text{church} - \text{jesus} + \text{muhammad} = \text{mosque}\]

Or how about:

\[\text{Washing D.C.} - \text{America} + \text{Mexico} = \text{Mexico City}\]

What is the purpose here? Word embeddings start to break down how words can be different, but still deal in the same contextual space. Whether we are talking about a church or a mosque, we are just dealing with a place of worship. If we have two different text documents (one talking about mosques and one talking about churches), it would be nice to be able to recognize that they are largely talking about the same idea.


library(text2vec)

links <- c("https://en.wikipedia.org/wiki/Christianity",
           "https://en.wikipedia.org/wiki/Islam",
           "https://en.wikipedia.org/wiki/Muslims",
           "https://en.wikipedia.org/wiki/Jesus",
           "https://en.wikipedia.org/wiki/Muhammad",
           "https://en.wikipedia.org/wiki/Quran",
           "https://en.wikipedia.org/wiki/Bible")

allText <- lapply(links, function(x) {
  read_html(x) %>%
  html_nodes("p") %>%
  html_text() %>%
  gsub("\\[[0-9]*\\]|[[:punct:]]", " ", .) %>%
  stringr::str_squish(.) %>%
  tolower(.) %>%
  tm::removeWords(., tm::stopwords("en"))
})

tokens <- space_tokenizer(unlist(allText))

it <- itoken(tokens, progressbar = FALSE)

vocab <- create_vocabulary(it)

vocab <- prune_vocabulary(vocab, term_count_min = 20L)

vectorizer <- vocab_vectorizer(vocab)

tcm <- create_tcm(it, vectorizer, skip_grams_window = 10L)

glove = GlobalVectors$new(rank = 50, x_max = 10)

glove_main <- glove$fit_transform(tcm, n_iter = 250, quiet = TRUE)

INFO  [05:26:35.292] epoch 1, loss 0.1697 
INFO  [05:26:35.338] epoch 2, loss 0.0916 
INFO  [05:26:35.369] epoch 3, loss 0.0710 
INFO  [05:26:35.394] epoch 4, loss 0.0592 
INFO  [05:26:35.418] epoch 5, loss 0.0511 
INFO  [05:26:35.442] epoch 6, loss 0.0451 
INFO  [05:26:35.466] epoch 7, loss 0.0404 
INFO  [05:26:35.490] epoch 8, loss 0.0367 
INFO  [05:26:35.514] epoch 9, loss 0.0337 
INFO  [05:26:35.539] epoch 10, loss 0.0311 
INFO  [05:26:35.563] epoch 11, loss 0.0290 
INFO  [05:26:35.587] epoch 12, loss 0.0271 
INFO  [05:26:35.611] epoch 13, loss 0.0255 
INFO  [05:26:35.635] epoch 14, loss 0.0241 
INFO  [05:26:35.659] epoch 15, loss 0.0229 
INFO  [05:26:35.684] epoch 16, loss 0.0218 
INFO  [05:26:35.708] epoch 17, loss 0.0208 
INFO  [05:26:35.732] epoch 18, loss 0.0199 
INFO  [05:26:35.757] epoch 19, loss 0.0191 
INFO  [05:26:35.781] epoch 20, loss 0.0184 
INFO  [05:26:35.805] epoch 21, loss 0.0178 
INFO  [05:26:35.829] epoch 22, loss 0.0171 
INFO  [05:26:35.853] epoch 23, loss 0.0166 
INFO  [05:26:35.877] epoch 24, loss 0.0161 
INFO  [05:26:35.902] epoch 25, loss 0.0156 
INFO  [05:26:35.926] epoch 26, loss 0.0151 
INFO  [05:26:35.949] epoch 27, loss 0.0147 
INFO  [05:26:35.973] epoch 28, loss 0.0143 
INFO  [05:26:35.998] epoch 29, loss 0.0140 
INFO  [05:26:36.021] epoch 30, loss 0.0136 
INFO  [05:26:36.045] epoch 31, loss 0.0133 
INFO  [05:26:36.069] epoch 32, loss 0.0130 
INFO  [05:26:36.092] epoch 33, loss 0.0127 
INFO  [05:26:36.115] epoch 34, loss 0.0125 
INFO  [05:26:36.141] epoch 35, loss 0.0122 
INFO  [05:26:36.164] epoch 36, loss 0.0120 
INFO  [05:26:36.186] epoch 37, loss 0.0117 
INFO  [05:26:36.213] epoch 38, loss 0.0115 
INFO  [05:26:36.236] epoch 39, loss 0.0113 
INFO  [05:26:36.258] epoch 40, loss 0.0111 
INFO  [05:26:36.284] epoch 41, loss 0.0109 
INFO  [05:26:36.307] epoch 42, loss 0.0107 
INFO  [05:26:36.330] epoch 43, loss 0.0106 
INFO  [05:26:36.354] epoch 44, loss 0.0104 
INFO  [05:26:36.380] epoch 45, loss 0.0102 
INFO  [05:26:36.402] epoch 46, loss 0.0101 
INFO  [05:26:36.427] epoch 47, loss 0.0099 
INFO  [05:26:36.453] epoch 48, loss 0.0098 
INFO  [05:26:36.476] epoch 49, loss 0.0096 
INFO  [05:26:36.501] epoch 50, loss 0.0095 
INFO  [05:26:36.529] epoch 51, loss 0.0094 
INFO  [05:26:36.553] epoch 52, loss 0.0093 
INFO  [05:26:36.578] epoch 53, loss 0.0091 
INFO  [05:26:36.607] epoch 54, loss 0.0090 
INFO  [05:26:36.630] epoch 55, loss 0.0089 
INFO  [05:26:36.657] epoch 56, loss 0.0088 
INFO  [05:26:36.684] epoch 57, loss 0.0087 
INFO  [05:26:36.708] epoch 58, loss 0.0086 
INFO  [05:26:36.732] epoch 59, loss 0.0085 
INFO  [05:26:36.757] epoch 60, loss 0.0084 
INFO  [05:26:36.782] epoch 61, loss 0.0083 
INFO  [05:26:36.806] epoch 62, loss 0.0082 
INFO  [05:26:36.831] epoch 63, loss 0.0081 
INFO  [05:26:36.854] epoch 64, loss 0.0081 
INFO  [05:26:36.878] epoch 65, loss 0.0080 
INFO  [05:26:36.903] epoch 66, loss 0.0079 
INFO  [05:26:36.927] epoch 67, loss 0.0078 
INFO  [05:26:36.951] epoch 68, loss 0.0077 
INFO  [05:26:36.975] epoch 69, loss 0.0077 
INFO  [05:26:36.999] epoch 70, loss 0.0076 
INFO  [05:26:37.022] epoch 71, loss 0.0075 
INFO  [05:26:37.047] epoch 72, loss 0.0074 
INFO  [05:26:37.071] epoch 73, loss 0.0074 
INFO  [05:26:37.095] epoch 74, loss 0.0073 
INFO  [05:26:37.119] epoch 75, loss 0.0073 
INFO  [05:26:37.142] epoch 76, loss 0.0072 
INFO  [05:26:37.165] epoch 77, loss 0.0071 
INFO  [05:26:37.190] epoch 78, loss 0.0071 
INFO  [05:26:37.213] epoch 79, loss 0.0070 
INFO  [05:26:37.236] epoch 80, loss 0.0070 
INFO  [05:26:37.263] epoch 81, loss 0.0069 
INFO  [05:26:37.286] epoch 82, loss 0.0068 
INFO  [05:26:37.309] epoch 83, loss 0.0068 
INFO  [05:26:37.335] epoch 84, loss 0.0067 
INFO  [05:26:37.359] epoch 85, loss 0.0067 
INFO  [05:26:37.381] epoch 86, loss 0.0066 
INFO  [05:26:37.405] epoch 87, loss 0.0066 
INFO  [05:26:37.432] epoch 88, loss 0.0065 
INFO  [05:26:37.454] epoch 89, loss 0.0065 
INFO  [05:26:37.478] epoch 90, loss 0.0064 
INFO  [05:26:37.504] epoch 91, loss 0.0064 
INFO  [05:26:37.527] epoch 92, loss 0.0064 
INFO  [05:26:37.549] epoch 93, loss 0.0063 
INFO  [05:26:37.575] epoch 94, loss 0.0063 
INFO  [05:26:37.598] epoch 95, loss 0.0062 
INFO  [05:26:37.622] epoch 96, loss 0.0062 
INFO  [05:26:37.649] epoch 97, loss 0.0061 
INFO  [05:26:37.671] epoch 98, loss 0.0061 
INFO  [05:26:37.694] epoch 99, loss 0.0061 
INFO  [05:26:37.720] epoch 100, loss 0.0060 
INFO  [05:26:37.744] epoch 101, loss 0.0060 
INFO  [05:26:37.766] epoch 102, loss 0.0059 
INFO  [05:26:37.793] epoch 103, loss 0.0059 
INFO  [05:26:37.816] epoch 104, loss 0.0059 
INFO  [05:26:37.837] epoch 105, loss 0.0058 
INFO  [05:26:37.864] epoch 106, loss 0.0058 
INFO  [05:26:37.887] epoch 107, loss 0.0058 
INFO  [05:26:37.909] epoch 108, loss 0.0057 
INFO  [05:26:37.935] epoch 109, loss 0.0057 
INFO  [05:26:37.958] epoch 110, loss 0.0057 
INFO  [05:26:37.980] epoch 111, loss 0.0056 
INFO  [05:26:38.003] epoch 112, loss 0.0056 
INFO  [05:26:38.029] epoch 113, loss 0.0056 
INFO  [05:26:38.052] epoch 114, loss 0.0055 
INFO  [05:26:38.076] epoch 115, loss 0.0055 
INFO  [05:26:38.102] epoch 116, loss 0.0055 
INFO  [05:26:38.126] epoch 117, loss 0.0054 
INFO  [05:26:38.150] epoch 118, loss 0.0054 
INFO  [05:26:38.178] epoch 119, loss 0.0054 
INFO  [05:26:38.202] epoch 120, loss 0.0054 
INFO  [05:26:38.227] epoch 121, loss 0.0053 
INFO  [05:26:38.255] epoch 122, loss 0.0053 
INFO  [05:26:38.280] epoch 123, loss 0.0053 
INFO  [05:26:38.307] epoch 124, loss 0.0053 
INFO  [05:26:38.332] epoch 125, loss 0.0052 
INFO  [05:26:38.356] epoch 126, loss 0.0052 
INFO  [05:26:38.381] epoch 127, loss 0.0052 
INFO  [05:26:38.404] epoch 128, loss 0.0051 
INFO  [05:26:38.428] epoch 129, loss 0.0051 
INFO  [05:26:38.452] epoch 130, loss 0.0051 
INFO  [05:26:38.477] epoch 131, loss 0.0051 
INFO  [05:26:38.500] epoch 132, loss 0.0051 
INFO  [05:26:38.524] epoch 133, loss 0.0050 
INFO  [05:26:38.549] epoch 134, loss 0.0050 
INFO  [05:26:38.573] epoch 135, loss 0.0050 
INFO  [05:26:38.597] epoch 136, loss 0.0050 
INFO  [05:26:38.621] epoch 137, loss 0.0049 
INFO  [05:26:38.645] epoch 138, loss 0.0049 
INFO  [05:26:38.669] epoch 139, loss 0.0049 
INFO  [05:26:38.693] epoch 140, loss 0.0049 
INFO  [05:26:38.717] epoch 141, loss 0.0048 
INFO  [05:26:38.740] epoch 142, loss 0.0048 
INFO  [05:26:38.765] epoch 143, loss 0.0048 
INFO  [05:26:38.789] epoch 144, loss 0.0048 
INFO  [05:26:38.812] epoch 145, loss 0.0048 
INFO  [05:26:38.837] epoch 146, loss 0.0047 
INFO  [05:26:38.860] epoch 147, loss 0.0047 
INFO  [05:26:38.884] epoch 148, loss 0.0047 
INFO  [05:26:38.908] epoch 149, loss 0.0047 
INFO  [05:26:38.932] epoch 150, loss 0.0047 
INFO  [05:26:38.955] epoch 151, loss 0.0046 
INFO  [05:26:38.980] epoch 152, loss 0.0046 
INFO  [05:26:39.003] epoch 153, loss 0.0046 
INFO  [05:26:39.026] epoch 154, loss 0.0046 
INFO  [05:26:39.050] epoch 155, loss 0.0046 
INFO  [05:26:39.074] epoch 156, loss 0.0046 
INFO  [05:26:39.098] epoch 157, loss 0.0045 
INFO  [05:26:39.122] epoch 158, loss 0.0045 
INFO  [05:26:39.147] epoch 159, loss 0.0045 
INFO  [05:26:39.172] epoch 160, loss 0.0045 
INFO  [05:26:39.196] epoch 161, loss 0.0045 
INFO  [05:26:39.221] epoch 162, loss 0.0045 
INFO  [05:26:39.245] epoch 163, loss 0.0044 
INFO  [05:26:39.268] epoch 164, loss 0.0044 
INFO  [05:26:39.293] epoch 165, loss 0.0044 
INFO  [05:26:39.317] epoch 166, loss 0.0044 
INFO  [05:26:39.341] epoch 167, loss 0.0044 
INFO  [05:26:39.366] epoch 168, loss 0.0044 
INFO  [05:26:39.389] epoch 169, loss 0.0043 
INFO  [05:26:39.413] epoch 170, loss 0.0043 
INFO  [05:26:39.438] epoch 171, loss 0.0043 
INFO  [05:26:39.462] epoch 172, loss 0.0043 
INFO  [05:26:39.486] epoch 173, loss 0.0043 
INFO  [05:26:39.511] epoch 174, loss 0.0043 
INFO  [05:26:39.535] epoch 175, loss 0.0043 
INFO  [05:26:39.559] epoch 176, loss 0.0042 
INFO  [05:26:39.584] epoch 177, loss 0.0042 
INFO  [05:26:39.607] epoch 178, loss 0.0042 
INFO  [05:26:39.631] epoch 179, loss 0.0042 
INFO  [05:26:39.656] epoch 180, loss 0.0042 
INFO  [05:26:39.680] epoch 181, loss 0.0042 
INFO  [05:26:39.704] epoch 182, loss 0.0042 
INFO  [05:26:39.729] epoch 183, loss 0.0041 
INFO  [05:26:39.753] epoch 184, loss 0.0041 
INFO  [05:26:39.776] epoch 185, loss 0.0041 
INFO  [05:26:39.801] epoch 186, loss 0.0041 
INFO  [05:26:39.826] epoch 187, loss 0.0041 
INFO  [05:26:39.850] epoch 188, loss 0.0041 
INFO  [05:26:39.874] epoch 189, loss 0.0041 
INFO  [05:26:39.899] epoch 190, loss 0.0041 
INFO  [05:26:39.923] epoch 191, loss 0.0040 
INFO  [05:26:39.947] epoch 192, loss 0.0040 
INFO  [05:26:39.972] epoch 193, loss 0.0040 
INFO  [05:26:39.996] epoch 194, loss 0.0040 
INFO  [05:26:40.020] epoch 195, loss 0.0040 
INFO  [05:26:40.044] epoch 196, loss 0.0040 
INFO  [05:26:40.068] epoch 197, loss 0.0040 
INFO  [05:26:40.091] epoch 198, loss 0.0040 
INFO  [05:26:40.116] epoch 199, loss 0.0039 
INFO  [05:26:40.140] epoch 200, loss 0.0039 
INFO  [05:26:40.164] epoch 201, loss 0.0039 
INFO  [05:26:40.188] epoch 202, loss 0.0039 
INFO  [05:26:40.211] epoch 203, loss 0.0039 
INFO  [05:26:40.235] epoch 204, loss 0.0039 
INFO  [05:26:40.260] epoch 205, loss 0.0039 
INFO  [05:26:40.284] epoch 206, loss 0.0039 
INFO  [05:26:40.308] epoch 207, loss 0.0039 
INFO  [05:26:40.333] epoch 208, loss 0.0038 
INFO  [05:26:40.357] epoch 209, loss 0.0038 
INFO  [05:26:40.381] epoch 210, loss 0.0038 
INFO  [05:26:40.405] epoch 211, loss 0.0038 
INFO  [05:26:40.429] epoch 212, loss 0.0038 
INFO  [05:26:40.453] epoch 213, loss 0.0038 
INFO  [05:26:40.477] epoch 214, loss 0.0038 
INFO  [05:26:40.502] epoch 215, loss 0.0038 
INFO  [05:26:40.525] epoch 216, loss 0.0038 
INFO  [05:26:40.549] epoch 217, loss 0.0038 
INFO  [05:26:40.574] epoch 218, loss 0.0037 
INFO  [05:26:40.598] epoch 219, loss 0.0037 
INFO  [05:26:40.622] epoch 220, loss 0.0037 
INFO  [05:26:40.647] epoch 221, loss 0.0037 
INFO  [05:26:40.671] epoch 222, loss 0.0037 
INFO  [05:26:40.695] epoch 223, loss 0.0037 
INFO  [05:26:40.720] epoch 224, loss 0.0037 
INFO  [05:26:40.743] epoch 225, loss 0.0037 
INFO  [05:26:40.767] epoch 226, loss 0.0037 
INFO  [05:26:40.791] epoch 227, loss 0.0037 
INFO  [05:26:40.815] epoch 228, loss 0.0037 
INFO  [05:26:40.839] epoch 229, loss 0.0036 
INFO  [05:26:40.864] epoch 230, loss 0.0036 
INFO  [05:26:40.888] epoch 231, loss 0.0036 
INFO  [05:26:40.911] epoch 232, loss 0.0036 
INFO  [05:26:40.936] epoch 233, loss 0.0036 
INFO  [05:26:40.960] epoch 234, loss 0.0036 
INFO  [05:26:40.983] epoch 235, loss 0.0036 
INFO  [05:26:41.007] epoch 236, loss 0.0036 
INFO  [05:26:41.031] epoch 237, loss 0.0036 
INFO  [05:26:41.054] epoch 238, loss 0.0036 
INFO  [05:26:41.078] epoch 239, loss 0.0036 
INFO  [05:26:41.103] epoch 240, loss 0.0036 
INFO  [05:26:41.128] epoch 241, loss 0.0035 
INFO  [05:26:41.152] epoch 242, loss 0.0035 
INFO  [05:26:41.177] epoch 243, loss 0.0035 
INFO  [05:26:41.201] epoch 244, loss 0.0035 
INFO  [05:26:41.226] epoch 245, loss 0.0035 
INFO  [05:26:41.250] epoch 246, loss 0.0035 
INFO  [05:26:41.274] epoch 247, loss 0.0035 
INFO  [05:26:41.298] epoch 248, loss 0.0035 
INFO  [05:26:41.323] epoch 249, loss 0.0035 
INFO  [05:26:41.347] epoch 250, loss 0.0035 

glove_context <- glove$components

word_vectors <- glove_main + t(glove_context)

churchVec <- word_vectors["bible", , drop = FALSE] -
  word_vectors["jesus", , drop = FALSE] +
  word_vectors["muhammad", , drop = FALSE]

cos_sim <- sim2(x = word_vectors, y = churchVec, method = "cosine", norm = "l2")

head(sort(cos_sim[, 1], decreasing = TRUE), 10)

 muhammad     bible     mecca      banu     books        ce    medina 
0.5809943 0.4536537 0.4057872 0.3936433 0.3564263 0.3491453 0.3110081 
      old   prophet     greek 
0.3096064 0.3033618 0.3023961